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Abstract 

We focus on the high dimensional Hnear regression Y ~ M {X (3* , a'^ In) , 
where /3* G MP is the parameter of interest. In this setting, several estimators 
such as the LASSO |Tib96| and the Dantzig Selector |CT07| are known to sat- 
isfy interesting properties whenever the vector (3* is sparse. Interestingly both 
of the LASSO and the Dantzig Selector can be seen as orthogonal projections 
of into VC{s) = {/? G IR,^, \\X'{Y - X/3)||oo < s} - using an £i distance for 
the Dantzig Selector and £2 for the LASSO. For a well chosen s > 0, this set is 
actually a confidence region for /3*. In this paper, we investigate the proper- 
ties of estimators defined as projections on 2?C(s) using general distances. We 
prove that the obtained estimators satisfy oracle properties close to the one 
of the LASSO and Dantzig Selector. On top of that, it turns out that these 
estimators can be tuned to exploit a different sparsity or/and slightly different 
estimation objectives. 

Keywords: High-dimensional data, LASSO, Restricted eigenvalue assump- 
tion, Sparsity, Variable selection. 
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1 Introduction 



In many modern applications, one has to deal with very large datasets. Regression 
problems may involve a large number of covariates, possibly larger than the sam- 
ple size. In this situation, a major issue lies in dimension reduction which can be 
performed through the selection of a small amount of relevant covariates. For this 
purpose, numerous regression methods have been proposed in the literature, ranging 
from the classical information criteria such as Cp, AIC and BIG to the more re- 
cent regularization-based techniques such as the ii penalized least square estimator, 
known as the LASSO [Tib96j . and the Dantzig selector |,CT07] . These £i-regularized 
regression methods have recently witnessed several developments due to the attrac- 
tive feature of computational feasibility, even for high dimensional data when the 
number of covariates p is large. 

Consider the linear regression model 

Y = Xf3* + e, (1) 

where F is a vector in M", /3* G MP is the parameter vector, X is an n x p real-valued 
matrix with possibly much fewer rows than columns, n p, and e is a random noise 
vector in M". Here, for the sake of simplicity, we will assume that e ~ A/'(0, cr'^In)- Let 
P denote the probability distribution of Y in this setting. Moreover, we assume that 
the matrix X is normalized in such a way that X'X/n has only 1 on its diagonal. The 
analysis of regularized regression methods for high dimensional data usually involves 
a sparsity assumption on /3* through the sparsity index ||/3*||o = Ylj=i p^Wj 0) 
where I(-) is the indicator function. For any q > 1, d > and a G W^, denote by 
W^Wl — SiLi kil'' ^^'^ ll'^lloo = maxi<j<£; |aj|, the £g and the i^o norms respectively. 
When the design matrix X is normalized, the LASSO and the Dantzig selector mini- 
mize respectively ||X/3||2 and under the constraint — X/3)||oo < s where s 
is a positive tuning parameter (e.g. [OPTOOt XlqOSj for the dual form of the LASSO). 
This geometric constraint is central in the approach developed in the present paper 
and we shall use it in a general perspective. Let us mention that several objec- 
tives may be considered by the statistician when we deal with the model given by 
Equation ([T]). Usually, we consider three specific objectives in the high-dimensional 
setting (i.e., p > n): 



Goal 1 - Prediction: The reconstruction of the signal X/3* with the best possi- 
ble accuracy is first considered. The quality of the reconstruction with an esti- 
mator P is often measured with the squared error — X/3*||2. In the stan- 
dard form, results are stated as follows: under assumptions on the matrix X and 
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with high probabihty, the prediction error is bounded by Clog (p)||/3*||o where C 
is a positive constant. Such results for the prediction issue have been obtained in 
IBRTOQi IBunOSi IBTWOTb] for the LASSO and in [BRTOgj for the Dantzig selec- 
tor. We also refer to [ KolOOal IKolOObl IMVdGBOQl FyHGOSi iDTOTl [CH08] for related 



works with different estimators (non-quadratic loss, penalties slightly different from 
ii and/or random design). The results obtained in the works above-mentioned are op- 
timal up to a logarithmic factor as it has been proved in |BTW07a] . See also |vdGB09] 



for a very nice survey paper on the various conditions used to prove these results. 

Goal 2 - Estimation: Another wishful thinking is that the estimator /3 is close to 
P* in terms of the ig distance for q > 1. The estimation bound is of the form 
C||/3*||o(log {p)/n)'J/^ where C is a positive constant. Such results are stated for the 
LASSO in [BTWOTai iBTWOTb] when g = 1, for the Dantzig selector in [CiTOT] when 
q = 2 and have been generalized in |BRT09] with 1 < q < 2 for both the LASSO end 
the Dantzig selector. 

Goal 3 - Selection: Since we consider variable selection methods, the identification 
of the true support {j : /3* ^ 0} of the vector /3* is to be considered. One ex- 
pects that the estimator f3 and the true vector /3* share the same support at least 
when n grows to infinity. This is known as the variable selection consistency prob- 
lem and it has been considered for the LASSO and the Dantzig Selector in several 
works [B^mO^ ILou08^ \MB06\ IMY09| IWai06^ [ZY06] . 



In this paper, we focus on variants of Goal 1 and Goal 2, using estimators /3 
that also satisfy the constraint — X/3)||oo < s. It is organized as follows. In 

Section [2] we give some general geometrical considerations on the LASSO and the 
Dantzig Selector that motivates the introduction of the general form of estimator: 

Argmin 

l3€\\X'{Y-XI3)\\oo<s 

for any semi-norm || ■ ||. In Section [3l we focus on two particular cases of interest in 
this family, and give some sparsity inequalities in the spirit of the ones in |BRT09| . 
We show that under the hypothesis that P/3* is sparse for a known matrix P, we are 
able to estimate properly Finally, Section H] is dedicated to proofs. 

2 Some geometrical considerations 

Definition 2.1. Let us put, for any s > 0, VC{s) = {/S eW : ||X'(F - X/3) ||oo < s}. 
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Lemma 1. For any s > 0, P(/3* G VC{s)) > 1 - j9exp(-s^/(2no-^)). 

This means that VC{s) is a confidence region for Moreover, note that 'DC(s) 
is convex and closed. Let || ■ || be any semi-norm in R^. Let nj| y denote an orthogonal 
projection on VC{s) with respect to || ■ ||: 

ni||(6) e Argmin \\f3 - 
I3evc{s) 

From properties of projections, we know that 

/3* e VC{s) ^ V6 G W, ||nf.||(6) - f3*\\ < \\b - 

There is a very simple interpretation to this inequality: if b is any estimator of /?*, 
then, with probability at least 1 — pexp(— s^/(2n(T^)), nj|.||(fe) is a better estimator. 
In order to perform shrinkage it seems natural to take 6 = 0. 

Definition 2.2. We define our general estimator by 

/^|-ll = nj|.|i(0)G Argmin 

We have the following examples: 

L for II ■ II = II ■ 111, we obtain the definition of the Dantzig Selector given in |CT07] . 

2. for WPW = WXpy, we obtain the program Argmin^gp^^^^ ||^/3||2. It was proved 
in [OPTOO] for example that a particular solution of this program is Tibshirani's 
LASSO estimator |Tib96j known as 

;g, = Argmin [||r-X/3||2 + 2s||/3||i]. 

3. for \\f3\\ = \\X'X(3\\q with g > 0, it is proved in |Alq08[ that the solution 
coincides with the "Correlation Selector" and it does not depend on q. 

In the next Section, we exhibit other cases of interest and provide some theoretical 
results on the performances of the estimators. 
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3 Generalized LASSO and Dantzig Selector and Spar- 
sity Inequalities 

Let A he a. p X p symmetric positive matrix and P he a. p x p invertible matrix 
satisfying both {X'X)P = A and KerA = KerX. The idea is that, for a well chosen 
II ■ II, we will build estimators that will be useful to estimate /3* when P/3* is sparse, 
which means that they will be close to /3* in the sense of the semi-norm induced by 
A. 

Let A-^ he any pseudo-inverse of A. Let = {X'X)A-^{X'X)/n (note that 
is uniquely defined, even if A~^ is not). 

Definition 3.1. We define the "Generalized Dantzig Selector", fif^^, as " for 
\\b\\ = \\Pb\\i, and the "Generalized LASSO", /3f^, as /SP^ for \\b\\ = {b'Aby/\ 

Remark 1. In the case where the program min/3gx>c(s) P'AP has multiple solutions we 
define as one of the solutions that minimizes \\PP\\i among all the solutions (5. 
The case where the program min^gx)C(s) II -P/^ 111 ^^-^ multiple solution does not cause 
any trouble: we can take as any of these solution without any effect on its 

statistical properties. 

We now present the assumptions we need to state the Sparsity Inequalities. Note 
that they essentially involve the matrix fi, and then, the matrices X and A. 
Assumption A{c) for c > 0: for any a G such that 

we have 

a^j < ca'Qa. 

This assumption can be seen as a modification of assumptions that can be found 
in |BRT09| : in |BRT09] . the same assumption is made on the matrix X'X. So here, 
in the case where A = X'X and P = Ip, we will obtain exactly the same assumption 
that in |BRT09| . 

Theorem 1. Let us take e G]0, 1[ and s = 2a{2n\og{p/e)y^'^ . Assume that As- 
sumption A{c) is satisfied for some c > 0. With probability at least 1 — e we have 
simultaneously: 

- - p*) < 72a2c||P/3*||olog , 
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and finally 



|P(/3f^ - nWi < 32V2a\\P(3*\\o\h^^^^ 

V n 



In the case A = X'X and P = Ip, we obtain the same resuh as in |BRT09| . 
However, it is worth noting that the use of f3f^ is particularly useful when P(3* is 
sparse and /3* is not. In this case the errors of the LASSO and the Dantzig Selector 
are not controlled anymore. This generalization is also of some interests especially 
when when Assumption A{c) is satisfied for Q, but not satisfied if we replace Q by 
X'X. 



4 Proofs 

4.1 Proof of Lemma [1] 

We have Y ~ Af{Xl3*,a^Q and so y - X/3* ~ Af{0,(T'^Q and finally X'{Y - 
Xl3*) ~ Af{0,a^X'X). Let us put V = X'{Y - XjS*) and let Vj denote the j-th 
coordinate of V. Note that X'X is normalized such that for any j, V^- ~ A/'(0, cr^n), 
so: P {\Vj\ > s)< exp(-sV(2na2)). Then P {\\V\\oo > s) < pexp(-sV(2na2)). □ 

4.2 Proof of Theorem [1] 

We use arguments from [BRT09) . From now, we assume that the event G 
VC{s/2)} = {\\X'{Y -X(3*\\oo < s/2} is satisfied. According to Lemma [H the prob- 
ability of this event is at least 1 — pexp(— s^/ (8ncr^)) = 1 — e as s = 2(2ri log{p/e)Y^'^. 
Proof of the results on the Generalized Dantzig Selector. 
We have 

- - 13*) = - /3*)'X'XP(/3f - /?*) 

< ||X'X(/3f^^ - /3*)|U||P(/3f^^ - nwi 
< (\\X'{Y - Xr)||oo + \\X'{Y - X/3f^^)|U) ||P(/3f^^ - r)iii 

<is/2 + s)\\P0f'"-n\\i 
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since /3f G VC{s), and {P* e VC{s/2)} is satisfied. By definition of /3f^^, 

o< i|priii-i|p/3f^^iii 

{P/3*),^0 (P/3*),7^0 (P/3*),=0 

(P/3*),^0 (P/9*),=0 



This means that 



(P/3*),^0 

We can summarize all that we have now: 

0GDS _ p*yj^0GDS _ < y ||P(/3f^^ - 

<3s J2 KW.-Wr^l- (2) 

(P/3*),^0 

Let us remark that Inequality ([2]) implies that the vector a = P0f^^ — f3*) may be 
used in Assumption A{c). This leads to 

0GDS _ f3*yA0GDs _ < 3, ^ |(p/3*), - (P/3f^^),| 

<3s(\\pn\o E [( - 

V {P/3*),¥0 



< 3s (\\pp*\\oc{p^f^^ - Pi3*yn{ppf^^ - pp* 

= 3s (\\P(3*\\o-0f''' - - /3*)y . (3) 

As a consequence, 

- /3*)'A(/3f - < 9slPn|oc = 72a'c||P/3*||o log . 
Plugging this result into Inequality ([3]) and using Inequality again, we obtain: 
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Proof of the results on the Generahzed LASSO. 

First, we establish an important property of the Generalized LASSO estimator. 
We prove that 

V/3 G W, \Y - XPf3f^\\l + 2s||P/3f^||i + 0f^yP'{nQ - X'X)Pf3f^ 

< \\Y - XP(3\\l + 2s\\Pf3\\i + f3'P'{nQ - X'X)P(5. (4) 

To prove Inequality (jl]), we write the Lagrangian of the program that defines Pf^: 
Let us write the Lagrangian of this program: 

A, ^) = P'AP + A' [X\XP -Y)- sE] + ^' [X'{Y - X/3) - sE] , 

where £' = (!,..., 1)', A and /i are vectors in R^. Moreover, for any j. A-,- > 0, yUj > 
and Aj/ij = 0. Any solution /3 = (3{X,n) must satisfy 

djC 

= — (^, A, fi) = 2A/3 + X'X{X - /i), 

and then Af3 = (A'A)(/i - A)/2 . Note that A^- > 0, > and Xjfij = imply that 
there is a 7j G R such that 7j = [fij — Aj)/2, \'~fj\ = {Xj + fij)/2. Hence A^ = 2(7j)_ 
and fij = 2(7j)+, where for any a, (a)+ = max(a;0) and (a)_ = max{—a]0). Let 
also 7 denote the vector which j-th component is exactly 7j, we obtain: 

A^=iX'Xh. (5) 

Note that this also implies that: 

P^AP = ^'(X'X)7 = P^AA-\X'X)-f = -f'{X'X)A-\X'X)-f = n-f'Qj. 
Using these relations, the Lagrangian may be written: 

p 

£(^,A,/i) = nyf^7 + 27'X'F-27'(X'X)^-2s^|7j| 

i=i 

= 27'X'F - n-f'n-f - 2s II7II 1 
Note that A and /3, and so 7, should maximize this value. Hence, 7 is to minimize 

-27'XT + n^n-f + 2s||7||i + Y'Y 

Now, note that 

Y'Y - 27'XT = ||r - X7||2 - 7'(X'X)7 
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and then 7 also minimizes 

\\Y - X^Wl + 2s hill + 7' [nn - (X'X)] 7. 
Let us put b = -P~^7, then b is to minimize 

||r - XPb\\l + 2s \\Pb\\-^ + (Pb)' [nil - (X'X)] (Pb). (6) 
Now, we know that must satisfy the relation 

Apf^ = X'XPb = Ab, 
where b is any minimizer of But we can check that this equality implies that 

\\Y-XPb\\l = \\Y-XP^^'^\\l 

and 

(Pb)' [nn - (X'X)] (Pb) = {P^f^^y [nn - (X'X)] (P/3f^). 

So, we necessarily have ||-P&||i = ||P/3f'^||i (otherwise we would have a contradiction 
with Remark [T]). Then Pf^ is also a minimizer of (E]). This proves Equation (jl]). 

The next step is to apply Equation (jl]) with /3 = /3* to obtain 

II F - XP/3f^||2 + 2s||P/3f ^lli + (/3f^)'P'(nfi - X'X)P/3f^ 

< ||F - XPI3*\\1 + 2s\\Pl3*\\i + {PP*)'{nVL - X'X)P/3*. 

For the sake of simplicity, we can define 7 = Pf3f^ and 7* = P/3* and we obtain 

||r - X7||^ + 2s||7||i + 7'(nfi - X'X)7 

< ||F - X7*||^ + 2s||7*||i + (7*)'(nfi - X'X)7*. 

Computations lead to 

11^(7 - 7*)ll2 + 2s||7||i + l'{nn - X'X)7 - 2{Y - X-f*yX^ 

+ 2(7*)'(nn - X'X)(7* - 7) < 2s||7*||i + {l*y{nn - X'X)^ 

- 2{Y - x-f*yx-f\ 

and then 

11^(7 -7*)ll^ 
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< 2s(||7l|i - II7II1) + 2(r - X7*)'X(7 - 7*) - (7* - lYinn - X'X){^* - 7). 
As a consequence 

(7* - ^ynn{^* - 7) < Hhli - II7II1) + 2{Y - X7*)'X(7 - 7* 

p 



j=i i=i 



j=i j=i 



So we obtain 



(7* - 7)'r^^^(7* - 7) + sJ2\'f^ - < 23^(17.1 - + 2sE l7. " 7*1 

j=i j=i j=i 

= 2s E (l7.|-|7;i) + 2. E l7.-7;i=4. E l7.-7;i- (7) 

i:7*7^0 i:7j'7^0 i:7j*7^0 

In particular, Equation ([7]) implies that 

E i7.-7;i<3 E i7.-7;i, 

and so a = 7^ — 7* may be used in Assumption A{c). Then Inequality ([7j) becomes 

(7* - 7)'nf^(7* - 7) < 4s E l7. - 7*1 < 4s I hUo E (7. - 71)' 

<4s(l|7l|oc(7*-7m7*-7))^. 

That leads to 

^ - ^ - /?*) = (7* - 7) nfi(7* - 7) < 128a^c||P/3*||o log (^) . 

We plug this result into Inequality (I7l) again to obtain 



n 

This ends the proof. □ 
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